Language identification

For language identifiers, see Language code.

Language identification is the process of determining which natural language given content is in. Traditionally, identification of written language - as practiced, for instance, in library science - has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a kind of text categorization, a Natural Language Processing approach which relies on statistical methods.

1 Non-Computational Approaches
2 Statistical Approaches
3 See also
4 References
5 External links

Non-Computational Approaches

In the field of library science, language identification is important for categorizing materials. As librarians often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent words and distinctive letters or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography, identifying several is often highly reliable.

Statistical Approaches

This can be done by comparing the compressibility of the text to the compressibility of texts in the known languages. This approach is known as mutual information based distance measure [1]. The same techniques can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.

Another technique, as described by Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The language model which is most similar to the model from the piece of text is the most likely language. This approach is problematic when the input text is in a language there is no model for. In this case, the method returns a random, "most similar" language as its result. Another problem are pieces of input text that are composed of several languages, as is common on the Web. For a more recent method, see Řehůřek and Kolkus (2009).

References

Benedetto, D., E. Caglioti and V. Loreto. Language trees and zipping. Physical Review Letters, 88:4 (2002) [2], [3], [4].

Cilibrasi, Rudi and Paul M.B. Vitanyi. "Clustering by compression". IEEE Transactions on Information Theory 51(4), April 2005, 1523-1545. [5]

Dunning, T. (1994) "Statistical Identification of Language". Technical Report MCCS 94-273, New Mexico State University, 1994.

Goodman, Joshua. (2002) Extended comment on "Language Trees and Zipping". Microsoft Research, Feb 21 2002. (This is a criticism of the data compression in favor of the Naive Bayes method.) [6]

Poutsma, Arjen. (2001) Applying Monte Carlo techniques to language identification. SmartHaven, Amsterdam. Presented at CLIN 2001.

The Economist. (2002) "The elements of style: Analysing compressed data leads to impressive results in linguistics [7]

Survey of the State of the Art in Human Language Technology, (1996), section 8.7 Automatic Language Identification [8]

Radim Řehůřek and Milan Kolkus. (2009) Language Identification on the Web: Extending the Dictionary Method [9]

External links

Language Identification Tools: list of links by Gertjan van Noord, with number of languages, brief description and license information.

LID - Language Identification in Python: algorithm and code example of an n-gram based LID tool in Python and Scheme by Damir Cavar.

AlchemyAPI: language identification API, available as SDK and through a RESTfull API (web-based demonstration).

PetaMem Language Identification: provides a choice between ngram, nvect and smart methods.

Open Xerox LanguageIdentifier, available in web-based form or through API.

What Language Is This? Online language identifier: web-based tool written by Henrik Falck.

Rosette Language Identifier: product by Basis Technology.

Language Identifier: product by Sematext; exposes Java API and is available through REST/Webservice.

G2LI (Global Information Infrastructure Laboratory's Language Identifier).

lid Language Identifier: by Lingua-Systems; C/C++ library and Perl Extension (online demo).

language-detection: open-source language detection library for Java (Apache License 2.0).

lc4j, a language categorization Java library, by Marco Olivo.

S.M.Mohammadzadeh: Language identification/detection related documents (26 February 2011).

Microsoft Extended Linguistic Services for Windows 7: including Microsoft Language Detection.

Windows 7 API Code Pack for .NET: including managed interfaces for the above.

Language identification

Contents

Non-Computational Approaches

Statistical Approaches

See also

References

External links